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Abstract — We survey a new area of parameter-free simi- 
larity distance measures useful in data-mining, pattern recog- 
nition, learning and automatic semantics extraction. Given a 
family of distances on a set of objects, a distance is universal 
up to a certain precision for that family if it minorizes every 
distance in the family between every two objects in the set, up 
to the stated precision (we do not require the universal dis- 
tance to be an element of the family). We consider similarity 
distances for two types of objects: literal objects that as such 
contain all of their meaning, like genomes or books, and names 
for objects. The latter may have literal embodyments like the 
first type, but may also be abstract like "red" or "Christian- 
ity." For the first type we consider a family of computable dis- 
tance measures corresponding to parameters expressing simi- 
larity according to particular features between pairs of literal 
objects. For the second type we consider similarity distances 
generated by web users corresponding to particular semantic 
relations between the (names for) the designated objects. For 
both families we give universal similarity distance measures, 
incorporating all particular distance measures in the family. 
In the first case the universal distance is based on compression 
and in the second case it is based on Google page counts re- 
lated to search terms. In both cases experiments on a massive 
scale give evidence of the viability of the approaches. 

I. Introduction 

Objects can be given literally, like the literal four-letter genome 
of a mouse, or the literal text of War and Peace by Tolstoy. For 
simplicity we take it that all meaning of the object is represented 
by the literal object itself. Objects can also be given by name, like 
"the four-letter genome of a mouse," or "the text of War and Peace 
by Tolstoy." There are also objects that cannot be given literally, 
but only by name and acquire their meaning from their contexts 
in background common knowledge in humankind, like "home" or 
"red." In the literal setting, objective similarity of objects can be 
established by feature analysis, one type of similarity per feature. 
In the abstract "name" setting, all similarity must depend on back- 
ground knowledge and common semantics relations, which is in- 
herently subjective and "in the mind of the beholder." 

II. Compression Based Similarity 

All data are created equal but some data are more alike than others. 
We have recently proposed methods expressing this alikeness, us- 
ing a new similarity metric based on compression. It is parameter- 
free in that it doesn't use any features or background knowledge 
about the data, and can without changes be applied to different 
areas and across area boundaries. It is universal in that it approxi- 
mates the parameter expressing similarity of the dominant feature 
in all pairwise comparisons. It is robust in the sense that its success 



appears independent from the type of compressor used. The clus- 
tering we use is hierarchical clustering in dendrograms based on a 
new fast heuristic for the quartet method. The method is available 
as an open-source software tool, 0. 

Feature-Based Similarities: We are presented with unknown 
data and the question is to determine the similarities among them 
and group like with like together. Commonly, the data are of a 
certain type: music files, transaction records of ATM machines, 
credit card applications, genomic data. In these data there are hid- 
den relations that we would like to get out in the open. For exam- 
ple, from genomic data one can extract letter- or block frequencies 
(the blocks are over the four-letter alphabet); from music files one 
can extract various specific numerical features, related to pitch, 
rhythm, harmony etc. One can extract such features using for in- 
stance Fourier transforms 1 25 1 or wavelet transforms 1 13 1, to quan- 
tify parameters expressing similarity. The resulting vectors corre- 
sponding to the various files are then classified or clustered using 
existing classification software, based on various standard statis- 
tical pattern recognition classifiers 1 25 ], Bayesian classifiers 1 1 1 1, 
hidden Markov models |6|, ensembles of nearest-neighbor clas- 
sifiers [13 1 or neural networks 111 II 1231 . For example, in music 
one feature would be to look for rhythm in the sense of beats per 
minute. One can make a histogram where each histogram bin cor- 
responds to a particular tempo in beats-per-minute and the associ- 
ated peak shows how frequent and strong that particular periodic- 
ity was over the entire piece. In 1 25 ] we see a gradual change from 
a few high peaks to many low and spread-out ones going from hip- 
hip, rock, jazz, to classical. One can use this similarity type to try 
to cluster pieces in these categories. However, such a method re- 
quires specific and detailed knowledge of the problem area, since 
one needs to know what features to look for. 

Non-Feature Similarities: Our aim is to capture, in a single 
similarity metric, every effective distance: effective versions of 
Hamming distance, Euclidean distance, edit distances, alignment 
distance, Lempel-Ziv distance, and so on. This metric should be 
so general that it works in every domain: music, text, literature, 
programs, genomes, executables, natural language determination, 
equally and simultaneously. It would be able to simultaneously 
detect all similarities between pieces that other effective distances 
can detect seperately. 

Such a "universal" metric was co-developed by us in II 811 191 , 
as a normalized version of the "information metric" of 1201 |T|. 
Roughly speaking, two objects are deemed close if we can sig- 
nificantly "compress" one given the information in the other, the 
idea being that if two pieces are more similar, then we can more 
succinctly describe one given the other. The mathematics used is 
based on Kolmogorov complexity theory 1 20 1 . In 1 1 9 1 we defined 
a new class of (possibly non-metric) distances, taking values in 
[0, 1] and appropriate for measuring effective similarity relations 
between sequences, say one type of similarity per distance, and 



vice versa. It was shown that an appropriately "normalized" infor- 
mation distance minorizes every distance in the class. It discovers 
all effective similarities in the sense that if two objects are close 
according to some effective similarity, then they are also close ac- 
cording to the normalized information distance. Put differently, 
the normalized information distance represents similarity accord- 
ing to the dominating shared feature between the two objects be- 
ing compared. In comparisons of more than two objects, different 
pairs may have different dominating features. For every two ob- 
jects, this universal metric distance zooms in on the dominant sim- 
ilarity between those two objects out of a wide class of admissible 
similarity features. In 1 19 1 we proved its optimality and universal- 
ity. The normalized information distance also satisfies the metric 
(in)equalities, and takes values in [0, 1]; hence it may be called 
"the " similarity metric. 

Normalized Compression Distance: Unfortunately, the uni- 
versality of the normalized information distance comes at the price 
of noncomputability, since it is based on the uncomputable notion 
of Kolmogorov complexity. But since the Kolmogorov complexity 
of a string or file is the length of the ultimate compressed version 
of that file, we can use real data compression programs to approx- 
imate the Kolmogorov complexity. Therefore, to apply this ideal 
precise mathematical theory in real life, we have to replace the use 
of the noncomputable Kolmogorov complexity by an approxima- 
tion using a standard real-world compressor. Thus, if C is a com- 
pressor and we use C(x) to denote the length of the compressed 
version of a string x, then we arrive at the Normalized Compres- 
sion Distance: 



NCD(*,y) 



C(xy)-mm(C(x),C{y)) 
max(C(jt),C(y)) 



(1) 



where for convenience we have replaced the pair (x,y) in the for- 
mula by the concatenation xy, see 1191 191, In |9| we propose ax- 
ioms to capture the real-world setting, and show that approxi- 
mates optimality. Actually, the NCD is a family of compression 
functions parameterized by the given data compressor C. 

Universality of NCD: In |9) we prove that the NCD is uni- 
versal with respect to the family of all admissible normalized 
distances — a special class that is argued to contain all parame- 
ters and features of similarity that are effective. The compression- 
based NCD method to establish a universal similarity metric Q 
among objects given as finite binary strings Q 1181 1191 15| 1141 . 
and has been applied to objects like genomes, music pieces in 
MIDI format, computer programs in Ruby or C, pictures in simple 
bitmap formats, or time sequences such as heart rhythm data, het- 
erogenous data and anomaly detection. This method is feature-free 
in the sense that it doesn't analyze the files looking for particular 
features; rather it analyzes all features simultaneously and deter- 
mines the similarity between every pair of objects according to the 
most dominant shared feature. The crucial point is that the method 
analyzes the objects themselves. This precludes comparison of ab- 
stract notions or other objects that don't lend themselves to direct 
analysis, like emotions, colors, Socrates, Plato, Mike Bonanno and 
Albert Einstein. 

III. Google-Based Similarity 

To make computers more intelligent one would like to repre- 
sent meaning in computer-digestable form. Long-term and labor- 
intensive efforts like the Cyc project 1 17 1 and the WordNet project 



1 24 1 try to establish semantic relations between common objects, 
or, more precisely, names for those objects. The idea is to create 
a semantic web of such vast proportions that rudimentary intelli- 
gence and knowledge about the real world spontaneously emerges. 
This comes at the great cost of designing structures capable of ma- 
nipulating knowledge, and entering high quality contents in these 
structures by knowledgeable human experts. While the efforts are 
long-running and large scale, the overall information entered is 
minute compared to what is available on the world-wide-web. 

The rise of the world-wide-web has enticed millions of users to 
type in trillions of characters to create billions of web pages of on 
average low quality contents. The sheer mass of the information 
available about almost every conceivable topic makes it likely that 
extremes will cancel and the majority or average is meaningful 
in a low-quality approximate sense. We devise a general method 
to tap the amorphous low-grade knowledge available for free on 
the world-wide-web, typed in by local users aiming at personal 
gratification of diverse objectives, and yet globally achieving what 
is effectively the largest semantic electronic database in the world. 
Moreover, this database is available for all by using any search 
engine that can return aggregate page-count estimates like Google 
for a large range of search-queries. 

While the previous NCD method that compares the objects 
themselves using Q is particularly suited to obtain knowledge 
about the similarity of objects themselves, irrespective of com- 
mon beliefs about such similarities, we now develop a method that 
uses only the name of an object and obtains knowledge about the 
similarity of objects by tapping available information generated 
by multitudes of web users. Here we are reminded of the words 
of D.H. Rumsfeld 1 22 1 "A trained ape can know an awful lot/ Of 
what is going on in this world,/ Just by punching on his mouse/ 
For a relatively modest cost!" The new method is useful to ex- 
tract knowledge from a given corpus of knowledge, in this case the 
Google database, but not to obtain true facts that are not common 
knowledge in that database. For example, common viewpoints 
on the creation myths in different religions may be extracted by 
the Googling method, but contentious questions of fact concern- 
ing the phylogeny of species can be better approached by using 
the genomes of these species, rather than by opinion. 

Googling for Knowledge: Let us start with simple intuitive 
justification (not to be mistaken for a substitute of the underlying 
mathematics) of the approach we propose in 1101 . The Google 
search engine indexes around ten billion pages on the web today. 
Each such page can be viewed as a set of index terms. A search 
for a particular index term, say "horse", returns a certain number 
of hits (web pages where this term occurred), say 46,700,000. The 
number of hits for the search term "rider" is, say, 12,200,000. It 
is also possible to search for the pages where both "horse" and 
"rider" occur. This gives, say, 2,630,000 hits. This can be easily 
put in the standard probabilistic framework. If w is a web page and 
x a search term, then we write x G w to mean that Google returns 
web page w when presented with search term x. An event is a set 
of web pages returned by Google after it has been presented by a 
search term. We can view the event as the collection of all con- 
texts of the search term, background knowledge, as induced by the 
accessible web pages for the Google search engine. If the search 
term is x, then we denote the event by x, and define x = {w : x 6 w}. 
The probability p(x) of an event x is the number of web pages in 



the event divided by the overall number M of web pages possibly 
returned by Google. Thus, p(x) = x|/M. At the time of writing, 
Google searches 8,058,044,651 web pages. Define the joint event 
x f] v = i w : x iy £ w } as me set °f we b pages returned by Google, 
containing both the search term x and the search term y. The joint 
probability p(x,y) = \ {w : x,y 6 w}|/M is the number of web pages 
in the joint event divided by the overall number M of web pages 
possibly returned by Google. This notation also allows us to de- 
fine the probability p(x|y) of conditional events x|y = (xf)y)/y 
defined by p(x\y) = p{x,y)/p(y). 

In the above example we have therefore pihorse) ~ 0.0058, 
p(rider) ~ 0.0015, pihorse, rider) ~ 0.0003. We conclude that 
the probability p(horse\rider) of "horse" accompanying "rider" is 
~ 1/5 and the probability p(rider\horse) of "rider" accompany- 
ing "horse" is ~ 1/19. The probabilities are asymmetric, and it 
is the least probability that is the significant one. A very general 
search term like "the" occurs in virtually all (English language) 
web pages. Hence p(the\rider) ~ 1, and for almost all search 
terms x we have p(the\x) ~ 1. But p(rider\the) <C 1, say about 
equal to p(rider), and gives the relevant information about the as- 
sociation of the two terms. 

Our first attempt therefore could be the distance 



Di(x,y) 



*{p(x\y),p(y\x)}- 



Experimenting with this distance gives bad results. One reason be- 
ing that the differences among small probabilities have increasing 
significance the smaller the probabilities involved are. Another 
reason is that we deal with absolute probabilities: two notions 
that have very small probabilities each and have D\ -distance e are 
much less similar than two notions that have much larger probabil- 
ities and have the same D\ -distance. To resolve the first problem 
we take the negative logarithm of the items being minimized, re- 
sulting in 

D 2 (x,y) = max{log l/p(x\y),log \/p(y\x)}. 

To resolve the second problem we normalize Di(x,y) by dividing 
by the maximum of log l/p(x), log 1 / p (y ) . Altogether, we obtain 
the following normalized distance 



D 3 (x,y) 



max{logl/p(x|y),logl/p(y|x)} 



max{log l/p(x) , log 1 / p(y) } 

for p(x\y) > (and hence p(y\x) > 0), and D^(x,y) = °° 
for p(x\y) = (and hence p(y\x) = 0). Note that p(x\y) = 
p(x,y)/ p(x) = means that the search terms "x" and "y" never 
occur together. The two conditional complexities are either both 
or they are both strictly positive. Moreover, if either of p(x),p(y) 
is 0, then so are the conditional probabilities, but not necessarily 
vice versa. 

We note that in the conditional probabilities the total number 
M, of web pages indexed by Google, is divided out. Therefore, 
the conditional probabilities are independent of M, and can be re- 
placed by the number of pages, the frequency, returned by Google. 
Define the frequency f(x) of search term x as the number of pages 
a Google search for x returns: f(x) = Mp(x), f(x,y) = Mp(x,y), 
and p(x\y) = f{x,y)/f{y). Rewriting Dj results in our final no- 
tion, the normalized Google distance ( NGD ), defined by 



NGD(x,y) = 



max{log/(x) , log /(>■) } - log f(x, y) } 
logM - min{log f(x) , log f(y) } 



(2) 



and if f{x),f{y) > and f(x,y) = then NGD(x,y) = °°. From 
we see that 

1. NGD(x,y) is undefined for f(x) = f(y) = 0; 

2. NGD(ij) = oo for f(x,y) = and either or both f(x) > 
and /(y) > 0; and 

3. NGD(x,y) > otherwise. 

With the Google hit numbers above, we can now compute 

NGD(horse, rider) a; 0.443. 

We did the same calculation when Google indexed only one-half 
of the current number of pages: 4,285,199,774. It is instructive 
that the probabilities of the used search terms didn't change sig- 
nificantly over this doubling of pages, with number of hits for 
"horse" equal 23,700,000, for "rider" equal 6,270,000, and for 
"horse, rider" equal to 1,180,000. The NGD (horse, rider) we 
computed in that situation was 0.460. This is in line with our con- 
tention that the relative frequencies of web pages containing search 
terms gives objective information about the semantic relations be- 
tween the search terms. If this is the case, then with the vastness 
of the information accessed by Google, the Google probabilities 
of search terms, and the computed NGD 's should stabilize (be 
scale invariant) with a growing Google database. 

The NGD formula itself is scale-invariant. It is very im- 
portant that, if the number M of pages indexed by Google grows 
sufficiently large, the number of pages containing given search 
terms goes to a fixed fraction of M, and so does the number of 
pages containing conjunctions of search terms. This means that if 
M doubles, then so do the /-frequencies. For the NGD to give 
us an objective semantic relation between search terms, it needs to 
become stable when the number M of indexed pages grows. Some 
evidence that this actually happens is given in the remark about the 
NGD scaling properly. 

IV. From NCD to NGD 
The Google Distribution: Let the set of singleton Google search 
terms be denoted by S . In the sequel we use both singleton search 
terms and doubleton search terms {{x,y} : x,y e 5}. Let the set 
of web pages indexed (possible of being returned) by Google be 
Q.. The cardinality of £2 is denoted by M = |£2|, and currently 
8 • 10 9 < M < 9 ■ 10 9 . Assume that a priori all web pages are equi- 
probable, with the probability of being returned by Google being 
1 /M. A subset of £1 is called an event. Every search term x usable 
by Google defines a singleton Google event x C £1 of web pages 
that contain an occurrence of x and are returned by Google if we 
do a search for x. Let L : £1 — > [0, 1] be the uniform mass probabil- 
ity function. The probability of such an event x is L(x) = |x|/M. 
Similarly, the doubleton Google event xf~|y C £2 is the set of web 
pages returned by Google if we do a search for pages containing 
both search term x and search term y. The probability of this event 
is L(xp|y) = |x|"|y|/M. We can also define the other Boolean 
combinations: ^x = £l\x and x|Jy = ^\(^ x f)^y), each such 
event having a probability equal to its cardinality divided by M. 
If e is an event obtained from the basic events x,y, . . ., correspond- 
ing to basic search terms x,y, . .., by finitely many applications of 
the Boolean operations, then the probability L(e) = |e|/M. 

Google events capture in a particular sense all background 
knowledge about the search terms concerned available (to Google) 



on the web. Therefore, it is natural to consider code words for 
those events as coding this background knowledge. However, we 
cannot use the probability of the events directly to determine a 
prefix code such as the Shannon-Fano code [ 20 1 . The reason 
is that the events overlap and hence the summed probability ex- 
ceeds 1 . By the Kraft inequality 1 20 1 this prevents a correspond- 
ing Shannon-Fano code. The solution is to normalize: We use the 
probability of the Google events to define a probability mass func- 
tion over the set {{x,v} : x,y e s} of Google search terms, both 
singleton and doubleton. Define 

n = I ixfH 

counting each singleton set and each doubleton set (by definition 
unordered) once in the summation. Since every web page that is 
indexed by Google contains at least one occurrence of a search 
term, we have N > M. On the other hand, web pages contain on 
average not more than a certain constant a search terms. There- 
fore, N < aM. Define 



g(x) =L(x)M/N= \x\/N 

g(x,y) =L(xf]y)M/N= \xf]y\/N. 



(3) 



Then, Lxes 8i x ) + Lx, y es s(x,y) = 1- Note that g(x,y) is not a 
conventional joint distribution since possibly g(x) ^ Y,yes 8( x >y)- 
Rather, we consider g to be a probability mass function over the 
sample space {{x,y} : x,y ^ s}- This ^-distribution changes over 
time, and between different samplings from the distribution. But 
let us imagine that g holds in the sense of an instantaneous snap- 
shot. The real situation will be an approximation of this. Given the 
Google machinery, these are absolute probabilities which allow us 
to define the associated Shannon-Fano code for both the singletons 
and the doubletons. 

Normalized Google Distance The Google code length G is 
defined by 



G(x)=logl/g(x) 
G(x,y) =\ogl /g(x,y). 



(4) 



In contrast to strings x where the complexity C(x) represents the 
length of the compressed version of x using compressor C, for 
a search term x (just the name for an object rather than the ob- 
ject itself), the Google code of length G(x) represents the shortest 
expected prefix-code word length of the associated Google event 
x. The expectation is taken over the Google distribution p. In 
this sense we can use the Google distribution as a compressor for 
Google "meaning" associated with the search terms. The associ- 
ated NCD , now called the normalized Google distance ( NGD ) 
is then defined by with N substituted for M, rewritten as 



NGD(x,y) : 



G(x,y)-min(G(x),G(y)) 
max(G(x),G(y)) 



(5) 



This NGD is an approximation to the NID using the Shannon- 
Fano code (Google code) generated by the Google distribution 
as defining a compressor approximating the length of the Kol- 
mogorov code, using the background knowledge on the web as 
viewed by Google as conditional information. In experimental 



practice, we consider N (or M) as a normalization constant that 
can be adjusted. 

Universality of NGD: In the full paper 1 10] we show that ^2} 
and ^5) are close in typical situations. Our experimental results 
suggest that every reasonable (greater than any /(x)) value can be 
used for the normalizing factor N, and our results seem in general 
insensitive to this choice. In our software, this parameter N can be 
adjusted as appropriate, and we often use M for N. In the full paper 
we analyze the mathematical properties of NGD , and prove the 
universality of the Google distribution among web author based 
distributions, as well as the universality of the NGD with re- 
spect to the family of the individual web author's NGD 's, that 
is, their individual semantics relations, (with high probability) — 
not included here for space reasons. 

V. Applications 

Applications of NCD: We developed the CompLearn Toolkit, 1 5 1 , 
and performed experiments in vastly different application fields to 
test the quality and universality of the method. The success of the 
method as reported below depends strongly on the judicious use 
of encoding of the objects compared. Here one should use com- 
mon sense on what a real world compressor can do. There are 
situations where our approach fails if applied in a straightforward 
way. For example: comparing text files by the same authors in dif- 
ferent encodings (say, Unicode and 8-bit version) is bound to fail. 
For the ideal similarity metric based on Kolmogorov complexity 
as defined in 1 19 1 this does not matter at all, but for practical com- 
pressors used in the experiments it will be fatal. Similarly, in the 
music experiments below we use symbolic MIDI music file for- 
mat rather than wave format music files. The reason is that the 
strings resulting from straightforward discretizing the wave form 
files may be too sensitive to how we discretize. Further research 
may ovecome this problem. 

The NCD is not restricted to a specific application area, and 
works across application area boundaries. To extract a hierarchy 
of clusters from the distance matrix, we determine a dendrogram 
(binary tree) by a new quartet method and a fast heuristic to imple- 
ment it. The method is implemented and available as public soft- 
ware 1 5 1, and is robust under choice of different compressors. This 
approach gives the first completely automatic construction of the 
phylogeny tree based on whole mitochondrial genomes, I18II19I , 
a completely automatic construction of a language tree for over 
50 Euro- Asian languages 1 19 1, detects plagiarism in student pro- 
gramming assignments |4|, gives phylogeny of chain letters |2|, 
and clusters music 1 8 1. Moreover, the method turns out to be ro- 
bust under change of the underlying compressor-types: statistical 
(PPMZ), Lempel-Ziv based dictionary (gzip), block based (bzip2), 
or special purpose (Gencompress). 

To substantiate our claims of universality and robustness, in 
1 9 1 we report evidence of successful application in areas as diverse 
as genomics, virology, languages, literature, music, handwritten 
digits, astronomy, and combinations of objects from completely 
different domains, using statistical, dictionary, and block sorting 
compressors. In genomics we presented new evidence for major 
questions in Mammalian evolution, based on whole-mitochondrial 
genomic analysis: the Eutherian orders and the Marsupionta hy- 
pothesis against the Theria hypothesis. Apart from the experi- 
ments reported in |9|, the clustering by compression method re- 
ported in this paper has recently been used in many different areas 



all over the world. One item in our group was to analyze network 
traffic and cluster computer worms and virusses [26]. Finally, re- 
cent work 1 14 1 reports experiments with our method on all time 
sequence data used in all the major data-mining conferences in the 
last decade. Comparing the compression method with all major 
methods used in those conferences they established clear superi- 
ority of the compression method for clustering heterogenous data, 
and for anomaly detection. 

Applications of NGD: This new method is proposed in 1101 
to extract semantic knowledge from the world-wide-web for both 
supervised and unsupervised learning using the Google search en- 
gine in an unconventional manner. The approach is novel in its 
unrestricted problem domain, simplicity of implementation, and 
manifestly ontological underpinnings. We give evidence of el- 
ementary learning of the semantics of concepts, in contrast to 
most prior approaches (outside of Knowledge Representation re- 
search) that have neither the appearance nor the aim of dealing 
with ideas, instead using abstract symbols that remain permanently 
ungrounded throughout the machine learning application. The 
world-wide-web is the largest database on earth, and it induces a 
probability mass function, the Google distribution, via page counts 
for combinations of search queries. This distribution allows us to 
tap the latent semantic knowledge on the web. While in the NGD 
compression-based method one deals with the objects themselves, 
in the current work we deal with just names for the objects. In 
1 10 1, as proof of principle, we demonstrate positive correlations, 
evidencing an underlying semantic structure, in both numerical 
symbol notations and number-name words in a variety of natural 
languages and contexts. Next, we give applications in (i) unsu- 
pervised hierarchical clustering, demonstrating the ability to dis- 
tinguish between colors and numbers, and to distinguish between 
17th century Dutch painters; (ii) supervised concept-learning by 
example, using Support Vector Machines, demonstrating the abil- 
ity to understand electrical terms, religious terms, emergency inci- 
dents, and by conducting a massive experiment in understanding 
WordNet categories (7); and (iii) matching of meaning, in an ex- 
ample of automatic English-Spanish translation. 
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